{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use SageMaker Distributed Model Parallel with Amazon SageMaker to Launch an MNIST Training Job with Model Parallelization\n",
    "\n",
    "SageMaker Distributed Model Parallel (SMP) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SageMaker Distributed Model Parallel automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.\n",
    "\n",
    "Use this notebook to configure Sagemaker Distributed Model Parallel to train a model using an example PyTorch training script, `utils/pt_mnist.py` and [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk). \n",
    "\n",
    "\n",
    "### Additional Resources\n",
    "\n",
    "If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with Pytorch. \n",
    "\n",
    "* To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).\n",
    "\n",
    "* To learn more about using the SageMaker Python SDK with Pytorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).\n",
    "\n",
    "* To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Amazon SageMaker Initialization\n",
    "\n",
    "Run the following cell to initialize the notebook instance. Get the SageMaker execution role used to run this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "from sagemaker.pytorch import PyTorch\n",
    "from smexperiments.experiment import Experiment\n",
    "from smexperiments.trial import Trial\n",
    "import boto3\n",
    "from time import gmtime, strftime\n",
    "\n",
    "role = get_execution_role() # provide a pre-existing role ARN as an alternative to creating a new role\n",
    "print(f'SageMaker Execution Role:{role}')\n",
    "\n",
    "session = boto3.session.Session()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare your training script\n",
    "\n",
    "Run the following cell to view an example-training script you will use in this demo. This is a PyTorch 1.6 trianing script that uses the MNIST dataset. \n",
    "\n",
    "You will see that the script contains `SMP` specific operations and decorators, which configure model parallel training. See the training script comments to learn more about the SMP functions and types used in the script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run this cell to see an example of a training scripts that you can use to configure -\n",
    "# SageMaker Distributed Model Parallel with PyTorch version 1.6\n",
    "!cat utils/pt_mnist.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define SageMaker Training Job\n",
    "\n",
    "Next, you will use SageMaker Estimator API to define a SageMaker Training Job. You will use an [`Estimator`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) to define the number and type of EC2 instances Amazon SageMaker uses for training, as well as the size of the volume attached to those instances. \n",
    "\n",
    "You can update the following:\n",
    "* `processes_per_host`\n",
    "* `entry_point`\n",
    "* `instance_count`\n",
    "* `instance_type`\n",
    "* `base_job_name`\n",
    "\n",
    "In addition, you can supply and modify configuration parameters for the SageMaker Distributed Model Parallel library. These parameters will be passed in through the `distributions` argument, as shown below.\n",
    "\n",
    "### Update the Type and Number of EC2 Instances Used\n",
    "\n",
    "Specify your `processes_per_host`. Note that it must be a multiple of your partitions, which by default is 2.\n",
    "\n",
    "The instance type and number of instances you specify in `instance_type` and `instance_count` respectively will determine the number of GPUs Amazon SageMaker uses during training. Explicitly, `instance_type` will determine the number of GPUs on a single instance and that number will be multiplied by `instance_count`. \n",
    "\n",
    "You must specify values for `instance_type` and `instance_count` so that the total number of GPUs available for training is equal to `partitions` in `config` of `smp.init` in your training script. \n",
    "\n",
    "\n",
    "To look up instances types, see [Amazon EC2 Instance Types](https://aws.amazon.com/sagemaker/pricing/).\n",
    "\n",
    "\n",
    "### Uploading Checkpoint During Training or Resuming Checkpoint from Previous Training\n",
    "We also provide a custom way for users to upload checkpoints during training or resume checkpoints from previous training. We have integrated this into our `pt_mnist.py` example script. Please see the functions `aws_s3_sync`, `sync_local_checkpoints_to_s3`, and `sync_s3_checkpoints_to_local`. For the purpose of this example, we are only uploading a checkpoint during training, by using `sync_local_checkpoints_to_s3`. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After you have updated `entry_point`, `instance_count`, `instance_type` and `base_job_name`, run the following to create an estimator. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sagemaker_session = sagemaker.session.Session(boto_session=session)\n",
    "mpioptions = \"-verbose -x orte_base_help_aggregate=0 \"\n",
    "mpioptions += \"--mca btl_vader_single_copy_mechanism none \"\n",
    "\n",
    "all_experiment_names = [exp.experiment_name for exp in Experiment.list()]\n",
    "\n",
    "#choose an experiment name (only need to create it once)\n",
    "experiment_name = \"SM-MP-DEMO\"\n",
    "\n",
    "# Load the experiment if it exists, otherwise create \n",
    "if experiment_name not in all_experiment_names:\n",
    "    customer_churn_experiment = Experiment.create(\n",
    "        experiment_name=experiment_name, sagemaker_boto_client=boto3.client(\"sagemaker\")\n",
    "    )\n",
    "else:\n",
    "    customer_churn_experiment = Experiment.load(\n",
    "        experiment_name=experiment_name, sagemaker_boto_client=boto3.client(\"sagemaker\")\n",
    "    )\n",
    "\n",
    "# Create a trial for the current run\n",
    "trial = Trial.create(\n",
    "        trial_name=\"SMD-MP-demo-{}\".format(strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())),\n",
    "        experiment_name=customer_churn_experiment.experiment_name,\n",
    "        sagemaker_boto_client=boto3.client(\"sagemaker\"),\n",
    "    )\n",
    "\n",
    "\n",
    "smd_mp_estimator = PyTorch(\n",
    "          entry_point=\"pt_mnist.py\", # Pick your train script\n",
    "          source_dir='utils',\n",
    "          role=role,\n",
    "          instance_type='ml.p3.16xlarge',\n",
    "          sagemaker_session=sagemaker_session,\n",
    "          framework_version='1.6.0',\n",
    "          py_version='py3',\n",
    "          instance_count=1,\n",
    "          distribution={\n",
    "              \"smdistributed\": {\n",
    "                  \"modelparallel\": {\n",
    "                      \"enabled\":True,\n",
    "                      \"parameters\": {\n",
    "                          \"microbatches\": 4,\n",
    "                          \"placement_strategy\": \"spread\",\n",
    "                          \"pipeline\": \"interleaved\",\n",
    "                          \"optimize\": \"speed\",\n",
    "                          \"partitions\": 2,\n",
    "                          \"ddp\": True,\n",
    "                      }\n",
    "                  }\n",
    "              },\n",
    "              \"mpi\": {\n",
    "                    \"enabled\": True,\n",
    "                    \"processes_per_host\": 2, # Pick your processes_per_host\n",
    "                    \"custom_mpi_options\": mpioptions \n",
    "              },\n",
    "          },\n",
    "          base_job_name=\"SMD-MP-demo\",\n",
    "      )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, you will use the estimator to launch the SageMaker training job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "smd_mp_estimator.fit(\n",
    "        experiment_config={\n",
    "            \"ExperimentName\": customer_churn_experiment.experiment_name,\n",
    "            \"TrialName\": trial.trial_name,\n",
    "            \"TrialComponentDisplayName\": \"Training\",\n",
    "        })"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
